Skip to content

fix(reborn): remove per-record lock convoys via shared cas_update helper#5234

Open
henrypark133 wants to merge 16 commits into
mainfrom
fix/reborn-p1-cas-helper
Open

fix(reborn): remove per-record lock convoys via shared cas_update helper#5234
henrypark133 wants to merge 16 commits into
mainfrom
fix/reborn-p1-cas-helper

Conversation

@henrypark133

Copy link
Copy Markdown
Collaborator

Problem

Each Reborn persistence store wrapped its filesystem read-modify-write in a per-record tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) held across .await — a redundant in-process serializer over backends that already do versioned CAS. Under burst, one writer stalled inside its critical section blocks every other writer for that scope (the convoy that contributed to the runtime wedge). PR #5142 removed exactly this from ironclaw_turns; this PR finishes the job for the remaining stores via one shared helper.

Change

Shared helper (ironclaw_filesystem::cas_update): one mutex-free, bounded read-modify-write loop — read versioned snapshot → idempotent apply closure → CAS put → on VersionMismatch re-read and retry (32×, jittered 2–50ms backoff, 15s timeout), with a fail-closed capability gate. Generic over record/outcome/error; leaks no store types.

Migrated all 5 stores onto it and deleted every per-record mutex:

  • ironclaw_turns — re-homed its local fix(turns): prevent turn-state write convoy #5142 copy onto the shared helper (one owner).
  • ironclaw_run_state, ironclaw_threads — dropped the mutex; route through the helper.
  • ironclaw_resourcesgains a retry loop it never had (the lock was its only serializer).
  • ironclaw_secrets — deletes the Arc-keyed lock map → fixes an unbounded memory leak (it stored Arc, never pruned).

Post-condition: grep FILESYSTEM_RECORD_LOCKS / record_lock.lock().await / filesystem_secret_lock across the 5 store crates is empty.

Guardrail: .claude/rules/database.md records the invariant — filesystem read-modify-write must go through cas_update; never wrap it in a per-record mutex held across .await.

Tri-backend parity (no diverging experience by host)

The helper is RootFilesystem-level (backend-agnostic). Verified across all three ironclaw_filesystem backends:

  • in-memory (default): clippy 0 across the 6 crates; all store tests pass (turns/run_state/threads/resources/secrets).
  • postgres (--features postgres): CAS contract tests pass — postgres_native_put_cas_absent_rejects_existing_path, postgres_native_put_cas_any_increments_existing_version, postgres_transaction_rollback_discards_prior_put_after_later_cas_conflict.
  • libsql (--features libsql): CAS contract tests pass. (One unrelated append/dir contract test is flaky under parallel suite execution — passes 3/3 in isolation; db.rs is untouched by this PR, so it is pre-existing test-infra flakiness, not a regression.)

Notes

🤖 Generated with Claude Code

henrypark133 and others added 7 commits June 24, 2026 19:01
Extract the proven mutex-free CAS pattern from ironclaw_turns (PR #5142)
into one shared helper in ironclaw_filesystem so the remaining stores can
drop their per-record tokio::sync::Mutex convoys.

cas_update runs a bounded read-modify-write loop: read versioned snapshot,
run the caller's idempotent apply closure, CAS-put with the read version,
and on VersionMismatch re-read and retry with jittered exponential backoff
(2ms..50ms), capped at 32 retries and wrapped in a 15s timeout. A
fail-closed capability gate rejects backends that cannot CAS rather than
silently blind-overwriting. The helper is generic over the record type,
the outcome, and the caller's error; it never leaks store-specific types.

No store migrated yet — that follows in the next commits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…te helper

ironclaw_resources had NO CAS-retry loop — its per-record tokio::sync::Mutex
(FILESYSTEM_RECORD_LOCKS) held across the backend get/put awaits was the ONLY
serializer. Under burst, one writer stalled inside its critical section parked
every other same-scope writer (the convoy that contributed to the 2026-06-24
runtime wedge). Naively deleting that mutex without a retry loop loses updates:
a racing writer's single-attempt CAS detects the version mismatch and errors
out (cross-process CAS contention) instead of retrying.

Route update_snapshot through ironclaw_filesystem::cas_update (bounded CAS
retry + jittered backoff + 15s timeout + fail-closed capability gate) and
delete the per-record mutex, FilesystemRecordLock, the local PutError, and the
fail-open put_with_cas (CasExpectation::Any blind-overwrite fallback). The
helper fails closed instead; verified safe — these store aliases only ever
resolve to CAS-capable db/in-memory backends in production.

The public update API widens from FnOnce to FnMut because the helper re-runs
the closure against a freshly read snapshot on every CAS retry; leaf closures
that previously moved captured values now clone per invocation. Add PartialEq
to BudgetGateSnapshot (helper needs S: Clone + PartialEq).

Red->green regression: cas_snapshot::tests::concurrent_increments_have_no_lost_updates
proves RED on lock-removed-no-retry (lost updates / spurious contention) and
GREEN through the helper (every concurrent increment lands, no convoy).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ant per-record mutex

run_state held BOTH a store-local CAS-retry loop AND a per-record
tokio::sync::Mutex (FILESYSTEM_RECORD_LOCKS) across the backend get/put awaits
— belt-and-suspenders where the mutex was a redundant in-process serializer
over a backend that already does versioned CAS. Under burst the mutex convoyed
same-scope writers behind one stalled writer (the 2026-06-24 wedge pattern).

Route apply_update, update_status, start, and save_pending through
ironclaw_filesystem::cas_update (one shared CAS implementation: bounded retry +
jittered backoff + 15s timeout + fail-closed capability gate). Delete the
store-local FILESYSTEM_CAS_RETRIES loop, the per-record lock + accessor + its
two unit tests, the local PutError, and the fail-open put_with_cas
(CasExpectation::Any blind-overwrite fallback). The scope-ownership check and
the approval Pending guard move inside the re-runnable apply closure; the
ApprovalStatus guard returns Err from apply. discard_pending keeps its
read-then-delete, only the lock is dropped.

The run_state contract tests previously ran against LocalFilesystem (byte-only,
no versioned CAS) where the OLD fail-open fallback masked the missing CAS.
Production never routes run_state to LocalFilesystem — only to CAS-capable
db/in-memory backends — so the tests now use InMemoryBackend, matching the
production capability shape (CAS) instead of a non-production blind-overwrite path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…p per-record mutex

ensure_thread held the only per-record tokio::sync::Mutex
(FILESYSTEM_RECORD_LOCKS) in this crate, across the backend get/put awaits, to
serialize the check-then-create against Unsupported(WriteFile)-fallback
backends. Route it through ironclaw_filesystem::cas_update: the apply closure
returns the existing record unchanged (no-op, no write) when the thread already
exists with a matching scope, or builds the fresh StoredThreadRecord when
absent. A concurrent create-if-absent loser hits VersionMismatch, the helper
re-reads and re-runs apply which now sees the winner and reconciles scope —
exactly the old single-reconcile semantics, now via the shared CAS-retry loop
without any lock. Delete the lock infra (Weak map + accessor); add PartialEq to
StoredThreadRecord (helper needs S: Clone + PartialEq). The three other
record/txn loops were already lock-free and are left as-is.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cord mutex (Arc leak fix)

The secrets filesystem store kept FILESYSTEM_RECORD_LOCKS as a HashMap of
*strong* Arc<Mutex<()>> (not Weak), one entry per (scope,lease)/session path
ever seen, NEVER pruned — an unbounded memory leak. The per-record mutex was
also held across the backend get/put awaits, convoying same-key writers behind
one stalled writer (the 2026-06-24 wedge pattern).

Route consume/revoke/consume_session_use through ironclaw_filesystem::cas_update
(one shared CAS implementation), mapping the local CasDecision onto CasApply:
Commit/BestEffortCommit -> changed snapshot; Settle(Ok) -> unchanged snapshot
(helper skips the write via PartialEq no-op); Settle(Err)/not-found -> apply
error. Crypto decrypt + use-count increment run inside the re-runnable apply
closure (pure / recomputed from the freshly read record each retry).
validate_session is a pure read — its lock is simply deleted. Delete the entire
lock map (leak gone), cas_mutate/CasDecision/CAS_RETRY_ATTEMPTS, and the
fail-open put_with_version_fallback (helper fails closed). Add PartialEq to
StoredLease and StoredSession.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR #5142 removed the per-record mutex from ironclaw_turns but left a local copy
of the CAS read-modify-write loop (apply_with_retry, put_with_cas, PutError,
cas_retry_backoff + constants). Re-home it onto ironclaw_filesystem::cas_update
so there is ONE CAS implementation across the codebase. Behavior is preserved:
the 18 call sites and their closure shape (FnMut(InMemoryTurnStateStore) -> Fut)
are unchanged; the exact timeout/exhaustion error strings are preserved.

A BridgeError<T> sentinel carries the absent-record + default-snapshot no-op
through the helper's apply-error channel (the helper's own no-op check only
fires for Some(existing)==new; the turns store additionally must skip creating a
file for an empty default store, which the old new==old check covered because
read returned default on absent). The 500ms SNAPSHOT_READ_CACHE_TTL read cache
is kept as a separate layer: cleared around the CAS call and repopulated on the
next read (the helper does its own fresh get and does not surface the new
RecordVersion; every read_snapshot caller already discards the version, so
caching None is safe). Delete the now-unused local loop, put_with_cas, PutError,
cas_retry_backoff, and the duplicate CAS constants.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cord mutex

Add the invariant to crates/ironclaw_filesystem/CLAUDE.md (#2 "CAS is the
floor") and a pointer in .claude/rules/database.md: every filesystem
read-modify-write must go through the one shared ironclaw_filesystem::cas_update
helper; never wrap it in a per-record tokio::sync::Mutex held across the backend
.await (redundant serializer + convoy/wedge risk + leak). Also commit Cargo.lock
(async-trait dev-dep added to ironclaw_resources for the convoy regression test).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 25, 2026 11:53 Destroyed
@github-actions github-actions Bot added scope: docs Documentation scope: dependencies Dependency updates size: XL 500+ changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Jun 25, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates multiple persistence stores (ironclaw_resources, ironclaw_run_state, ironclaw_threads, ironclaw_secrets, and ironclaw_turns) to use a shared, lock-free compare-and-swap (CAS) helper (cas_update). This eliminates redundant per-record mutexes held across .await boundaries, preventing runtime-wedging convoys under high contention. Feedback on the changes highlights an opportunity to improve the jitter calculation in cas_retry_backoff using RandomState to avoid zero-jitter scenarios in low clock resolution environments (such as VMs or Windows). Additionally, it is recommended to explicitly add Send bounds to the closure generic constraints in cas_update and cas_update_loop to catch any non-Send regressions at the helper's definition site.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +311 to +325
async fn cas_retry_backoff(attempt: usize) {
let shift = attempt.min(8) as u32;
let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX);
let base_delay = FILESYSTEM_CAS_BACKOFF_BASE
.saturating_mul(multiplier)
.min(FILESYSTEM_CAS_BACKOFF_MAX);
let jitter = SystemTime::now()
.duration_since(UNIX_EPOCH)
.map(|elapsed| {
let jitter_ceiling = base_delay.as_millis().max(1);
Duration::from_millis((elapsed.as_nanos() % jitter_ceiling) as u64)
})
.unwrap_or_default();
tokio::time::sleep(base_delay.saturating_add(jitter)).await;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using SystemTime::now() for jitter calculation can lead to zero jitter on systems with low clock resolution (e.g., 1ms resolution in VMs, containers, or Windows). Since base_delay.as_millis() is at most 50, and 1,000,000 (1ms in nanoseconds) is divisible by all common backoff ceilings (2, 4, 8, 16, 32, 50), the modulo operation elapsed.as_nanos() % jitter_ceiling will always evaluate to exactly 0. This completely defeats the purpose of jitter under high contention.

Instead, use std::collections::hash_map::RandomState to obtain a high-quality, thread-safe, and un-correlated pseudo-random value seeded from the OS without adding external dependencies.

Suggested change
async fn cas_retry_backoff(attempt: usize) {
let shift = attempt.min(8) as u32;
let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX);
let base_delay = FILESYSTEM_CAS_BACKOFF_BASE
.saturating_mul(multiplier)
.min(FILESYSTEM_CAS_BACKOFF_MAX);
let jitter = SystemTime::now()
.duration_since(UNIX_EPOCH)
.map(|elapsed| {
let jitter_ceiling = base_delay.as_millis().max(1);
Duration::from_millis((elapsed.as_nanos() % jitter_ceiling) as u64)
})
.unwrap_or_default();
tokio::time::sleep(base_delay.saturating_add(jitter)).await;
}
async fn cas_retry_backoff(attempt: usize) {
let shift = attempt.min(8) as u32;
let multiplier = 1_u32.checked_shl(shift).unwrap_or(u32::MAX);
let base_delay = FILESYSTEM_CAS_BACKOFF_BASE
.saturating_mul(multiplier)
.min(FILESYSTEM_CAS_BACKOFF_MAX);
let jitter = {
use std::collections::hash_map::RandomState;
use std::hash::{BuildHasher, Hash, Hasher};
let mut hasher = RandomState::new().build_hasher();
attempt.hash(&mut hasher);
let hash = hasher.finish();
let jitter_ceiling = base_delay.as_millis().max(1) as u64;
Duration::from_millis(hash % jitter_ceiling)
};
tokio::time::sleep(base_delay.saturating_add(jitter)).await;
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0eaf69d — jitter now derives from a RandomState-seeded hash (clock-independent), so it no longer collapses to 0 on coarse-resolution clocks. See cas_retry_backoff in crates/ironclaw_filesystem/src/cas.rs.

Comment on lines +194 to +208
pub async fn cas_update<F, S, T, E, D, N, A, Fut>(
filesystem: &ScopedFilesystem<F>,
scope: &ResourceScope,
path: &ScopedPath,
decode: D,
encode: N,
mut apply: A,
) -> Result<T, CasUpdateError<E>>
where
F: RootFilesystem + ?Sized,
S: PartialEq + Clone,
D: Fn(&[u8]) -> Result<S, E>,
N: Fn(&S) -> Result<Entry, E>,
A: FnMut(Option<S>) -> Fut,
Fut: Future<Output = Result<CasApply<S, T>, E>>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the repository's general rules, async helper functions that return futures intended to be Send should explicitly add Send bounds to their closure generic constraints. This ensures that any non-Send regressions are caught at the helper's definition site rather than at the call site.

pub async fn cas_update<F, S, T, E, D, N, A, Fut>(
    filesystem: &ScopedFilesystem<F>,
    scope: &ResourceScope,
    path: &ScopedPath,
    decode: D,
    encode: N,
    mut apply: A,
) -> Result<T, CasUpdateError<E>>
where
    F: RootFilesystem + ?Sized,
    S: PartialEq + Clone + Send,
    T: Send,
    E: Send,
    D: Fn(&[u8]) -> Result<S, E> + Send,
    N: Fn(&S) -> Result<Entry, E> + Send,
    A: FnMut(Option<S>) -> Fut + Send,
    Fut: Future<Output = Result<CasApply<S, T>, E>> + Send,
References
  1. In Rust, explicitly add Send bounds to closure generic constraints in async helper functions that return futures intended to be Send. This ensures that non-Send regressions are caught at the helper's definition site rather than at the call site, even if the future is already bound by Send at the call site.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considered but declined. Adding Send to the helper generics forces propagating Send across consumer-crate generics (run_state mutate: M, turns persistence T) — threading a bound through multiple layers for marginal "catch at the def site" benefit, which .claude/rules/architecture.md flags as bound-creep. The futures are already Send-constrained where awaited/spawned.

Comment on lines +226 to +240
async fn cas_update_loop<F, S, T, E, D, N, A, Fut>(
filesystem: &ScopedFilesystem<F>,
scope: &ResourceScope,
path: &ScopedPath,
decode: &D,
encode: &N,
apply: &mut A,
) -> Result<T, CasUpdateError<E>>
where
F: RootFilesystem + ?Sized,
S: PartialEq + Clone,
D: Fn(&[u8]) -> Result<S, E>,
N: Fn(&S) -> Result<Entry, E>,
A: FnMut(Option<S>) -> Fut,
Fut: Future<Output = Result<CasApply<S, T>, E>>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the repository's general rules, async helper functions that return futures intended to be Send should explicitly add Send bounds to their closure generic constraints. This ensures that any non-Send regressions are caught at the helper's definition site rather than at the call site.

async fn cas_update_loop<F, S, T, E, D, N, A, Fut>(
    filesystem: &ScopedFilesystem<F>,
    scope: &ResourceScope,
    path: &ScopedPath,
    decode: &D,
    encode: &N,
    apply: &mut A,
) -> Result<T, CasUpdateError<E>>
where
    F: RootFilesystem + ?Sized,
    S: PartialEq + Clone + Send,
    T: Send,
    E: Send,
    D: Fn(&[u8]) -> Result<S, E> + Send,
    N: Fn(&S) -> Result<Entry, E> + Send,
    A: FnMut(Option<S>) -> Fut + Send,
    Fut: Future<Output = Result<CasApply<S, T>, E>> + Send,
References
  1. In Rust, explicitly add Send bounds to closure generic constraints in async helper functions that return futures intended to be Send. This ensures that non-Send regressions are caught at the helper's definition site rather than at the call site, even if the future is already bound by Send at the call site.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same decision as the cas_update bound above — declined to avoid propagating Send across consumer-crate generics for marginal benefit.

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Filesystem-backed read-modify-write paths now use cas_update with fail-closed gating, bounded retries, and timeout handling. Run-state, resources, secrets, threads, and turns were migrated off local lock/CAS loops, and discarded approval-state handling was updated.

Changes

Shared filesystem CAS migration

Layer / File(s) Summary
Contract and public API surface
.claude/rules/database.md, crates/ironclaw_filesystem/CLAUDE.md, docs/plans/2026-06-25-cas-migration.md, crates/ironclaw_filesystem/src/scoped.rs, crates/ironclaw_filesystem/src/lib.rs, crates/ironclaw_filesystem/Cargo.toml, crates/ironclaw_resources/Cargo.toml, crates/ironclaw_run_state/tests/approval_resolution_contract.rs, crates/ironclaw_capabilities/src/helpers.rs
Filesystem invariants and migration notes now require cas_update; ScopedFilesystem exposes capabilities; the crate root exports the CAS helper, types, and timing constants; and discarded approval-state mapping is added at the capability layer.
cas_update loop and tests
crates/ironclaw_filesystem/src/cas.rs, crates/ironclaw_filesystem/src/cas/tests.rs
The shared helper performs capability-gated retries, timeout, backoff, and mapped error handling, with behavior covered by tests.
CasSnapshotStore and resource retries
crates/ironclaw_resources/src/cas_snapshot.rs, crates/ironclaw_resources/src/filesystem_store.rs, crates/ironclaw_resources/src/lib.rs
CasSnapshotStore, resource-governor updates, and budget-gate updates switch to shared CAS retries and re-runnable closures, and the affected snapshot types derive PartialEq.
Run-state record updates
crates/ironclaw_run_state/src/lib.rs, crates/ironclaw_run_state/tests/*, crates/ironclaw_product_workflow/src/approval_interaction/service.rs
FilesystemRunStateStore and approval-request storage move creation, mutation, discard tombstones, and record filtering onto cas_update while removing the local lock map and legacy CAS write helpers; discarded-state handling and tests align to the new status.
Secret, broker, and thread CAS updates
crates/ironclaw_secrets/src/filesystem_store.rs, crates/ironclaw_threads/src/filesystem_service.rs, crates/ironclaw_host_runtime/tests/reborn_durable_restart_integration.rs
Secret, credential-broker, and thread persistence paths now use shared CAS retries, with updated derives, record kinds, error mapping, and in-memory backend wiring for durable test coverage.
Turn-state CAS update and cache handling
crates/ironclaw_turns/src/filesystem_store.rs, crates/ironclaw_turns/src/filesystem_store/tests.rs, crates/ironclaw_turns/tests/filesystem_turn_state_contract.rs, tests/support/reborn/*
Turn-state writes route through cas_update with cache clearing, overlay preservation, and no-op handling, and the regression tests and harness wiring cover the resulting cache/version behavior.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Possibly related PRs

  • nearai/ironclaw#5002: Direct overlap in crates/ironclaw_threads/src/filesystem_service.rs, including thread record persistence and update flow changes.
  • nearai/ironclaw#5142: Same turn-state snapshot write path migration to bounded CAS-style retry behavior in crates/ironclaw_turns/src/filesystem_store.rs.
  • nearai/ironclaw#5232: Adjacent ironclaw_turns runner-lease sidecar and CAS/heartbeat overlay behavior.

Suggested reviewers

  • think-in-universe

Poem

A rabbit hopped through retry light,
Where CAS kept each old write in sight.
Discarded gates went soft and still,
While snapshots danced with bounded will.
🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers the change, but it omits most required template sections like change type, linked issue, validation, rollback, and review track. Reformat it to the repo template and add the missing sections: Summary, Change Type, Linked Issue, Validation, Security Impact, Trust-Boundary Checklist, Database Impact, Blast Radius, Rollback Plan, Review Follow-Through, and Review track.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: replacing per-record lock convoys with the shared cas_update helper.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

self.filesystem.delete(scope, &path).await?;

P1 Badge Guard discard deletes with CAS

When discard_pending races in the same process with approve or deny for the same request, this unconditional delete can now run after the resolver's CAS-protected status update and remove the terminal approval record. The removed per-record lock used to serialize that get/check/delete sequence with status updates; without either that lock or a CAS/tombstone transition here, a user approval can be lost and later look like UnknownApprovalRequest rather than Approved.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/ironclaw_run_state/src/lib.rs (1)

922-932: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Make discard_pending atomic before dropping the lock.

This path reads Pending, then deletes later with no CAS expectation. A concurrent approve()/deny() can win the CAS update between Line 922 and Line 932, then this delete removes a non-pending approval record. Use a CAS-aware delete/transactional transition, or model discard as a CAS status update instead of read-then-delete. As per path instructions: “Fail loud” and filesystem read-modify-write paths must use the CAS invariant rather than split multi-step state changes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_run_state/src/lib.rs` around lines 922 - 932, The
discard_pending flow in RunState::discard_pending is currently split into a
read/validate step and a later filesystem delete, which can race with
approve()/deny() and remove a record after its status has already changed.
Update this path to be atomic by using a CAS-aware transition or transactional
delete that enforces the same status invariant at the point of mutation, rather
than relying on a prior Pending check; keep the failure path loud if the
expected state no longer matches.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_filesystem/src/cas.rs`:
- Line 193: The new Clippy allow on cas.rs needs the required arch-exempt
rationale comment immediately above #[allow(clippy::too_many_arguments)]. Update
the same site near the cas:: function signature that triggered the lint by
adding the mandatory // arch-exempt: too_many_args, <reason naming the missing
aggregation>, plan `#NNNN` annotation, and make sure the reason justifies why this
many parameters is unavoidable until the missing aggregation is introduced.
- Around line 295-306: The CAS preflight is treating
BackendCapabilities::default() as “unknown,” which lets a backend with no
advertised capabilities bypass the fail-closed gate. Update capabilities_known()
in cas.rs to only treat an explicit unknown state as unknown and not overload
BackendCapabilities::empty()/default(), then make the cas_update path rely on
that check so unsupported backends still return CasUpdateError::CasUnsupported
instead of falling back to op-time behavior.
- Around line 262-265: The no-op shortcut in cas_update_loop currently only
returns early when current is Some(existing) and matches the snapshot, while
absent records still fall through to encoding and writing; update the logic to
either treat None plus an unchanged/default snapshot as a true no-op or adjust
the CasApply/cas_update contract and surrounding comments to state the shortcut
only applies to existing records. Use the existing cas_update_loop and CasApply
symbols to keep the behavior and docs aligned, and remove any need for
downstream NoOp suppression if you choose to enforce it here.

In `@crates/ironclaw_filesystem/src/cas/tests.rs`:
- Around line 184-399: Add a regression test in cas/tests.rs for the timeout
path in cas_update, since the current tests cover retries, CAS rejection, and
apply errors but never exercise CasUpdateError::Timeout. Introduce a tiny wedged
backend or stub that blocks one of the CAS operations (get, put, or apply), and
use paused Tokio time so the outer timeout in cas_update deterministically
fires. Reference cas_update and CasUpdateError::Timeout in the new test, and
assert that the timeout branch wins over the hung backend behavior.

In `@crates/ironclaw_resources/src/cas_snapshot.rs`:
- Line 212: The CasUpdateError::Backend handling in cas_snapshot.rs currently
forwards the backend error string directly via E::storage_from(inner), which can
leak virtual paths and raw backend details into StorageError. Update this
mapping to return a stable sanitized public message from the error conversion
path, and preserve the detailed backend error only in internal diagnostics or
source context associated with E::storage_from.

In `@crates/ironclaw_run_state/src/lib.rs`:
- Line 1109: The `CasUpdateError::Backend` branch in `RunStateError` currently
converts the filesystem failure to a string, which drops the structured
`FilesystemError` context. Update this match arm in `lib.rs` to preserve the
typed error by using the existing `?`-style conversion path or direct typed
`From`/`Into` conversion used elsewhere in this file, so the `FilesystemError`
variant and its path/operation details continue to propagate through
`RunStateError::Filesystem`.

In `@crates/ironclaw_secrets/src/filesystem_store.rs`:
- Around line 479-483: The no-op CAS comment in filesystem_store.rs is
inaccurate: the `already_marked` branch in the lease update path returns
`Err(SecretStoreError::LeaseExpired { lease_id })` and goes through
`CasUpdateError::Apply`, not the unchanged record/`PartialEq` skip path. Update
the comment to describe the actual intent, or change the branch in the relevant
lease/CAS helper to return `CasApply::new(lease, Err(...))` if that is the
desired behavior; use the `already_marked`, `SecretLeaseStatus::Expired`, and
`CasUpdateError::Apply` symbols to locate and align the code with the documented
contract.

In `@crates/ironclaw_turns/src/filesystem_store.rs`:
- Around line 378-385: The CAS error mapper in map_cas_error should not panic on
BridgeError::NoOp, since this is production error-handling code. Replace the
unreachable! fallback in map_cas_error with a typed unavailable TurnError that
preserves context about the unexpected NoOp leak, while keeping the
BridgeError::Real(inner) path unchanged and consistent with CasUpdateError
handling.

In `@docs/plans/2026-06-25-cas-migration.md`:
- Around line 137-140: The quality gate in the migration plan is outdated and
should match the repo’s required backend validation for persistence changes.
Update the “Quality gate” section to include the feature-isolation `cargo check`
matrix for the default/postgres build, `--no-default-features --features
libsql`, and `--all-features`, and strengthen the clippy step to the full `cargo
clippy --all --benches --tests --examples --all-features -- -D warnings`. Keep
the existing formatting/tests coverage, but ensure the plan explicitly reflects
the required checks for the legacy persistence migration.

---

Outside diff comments:
In `@crates/ironclaw_run_state/src/lib.rs`:
- Around line 922-932: The discard_pending flow in RunState::discard_pending is
currently split into a read/validate step and a later filesystem delete, which
can race with approve()/deny() and remove a record after its status has already
changed. Update this path to be atomic by using a CAS-aware transition or
transactional delete that enforces the same status invariant at the point of
mutation, rather than relying on a prior Pending check; keep the failure path
loud if the expected state no longer matches.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 1c594037-4929-451d-97ad-73e096ed325d

📥 Commits

Reviewing files that changed from the base of the PR and between 44f063d and 3223540.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock, !**/Cargo.lock
📒 Files selected for processing (17)
  • .claude/rules/database.md
  • crates/ironclaw_filesystem/CLAUDE.md
  • crates/ironclaw_filesystem/src/cas.rs
  • crates/ironclaw_filesystem/src/cas/tests.rs
  • crates/ironclaw_filesystem/src/lib.rs
  • crates/ironclaw_filesystem/src/scoped.rs
  • crates/ironclaw_resources/Cargo.toml
  • crates/ironclaw_resources/src/cas_snapshot.rs
  • crates/ironclaw_resources/src/filesystem_store.rs
  • crates/ironclaw_resources/src/lib.rs
  • crates/ironclaw_run_state/src/lib.rs
  • crates/ironclaw_run_state/tests/approval_resolution_contract.rs
  • crates/ironclaw_run_state/tests/run_state_contract.rs
  • crates/ironclaw_secrets/src/filesystem_store.rs
  • crates/ironclaw_threads/src/filesystem_service.rs
  • crates/ironclaw_turns/src/filesystem_store.rs
  • docs/plans/2026-06-25-cas-migration.md

Comment thread crates/ironclaw_filesystem/src/cas.rs Outdated
Comment thread crates/ironclaw_filesystem/src/cas.rs Outdated
Comment thread crates/ironclaw_filesystem/src/cas.rs
Comment thread crates/ironclaw_filesystem/src/cas/tests.rs
Comment thread crates/ironclaw_resources/src/cas_snapshot.rs
Comment thread crates/ironclaw_run_state/src/lib.rs
Comment thread crates/ironclaw_secrets/src/filesystem_store.rs
Comment thread crates/ironclaw_turns/src/filesystem_store.rs Outdated
Comment thread docs/plans/2026-06-25-cas-migration.md Outdated
@railway-app

railway-app Bot commented Jun 25, 2026

Copy link
Copy Markdown

🚅 Deployed to the ironclaw-pr-5234 environment in ironclaw-ci-preview

Service Status Web Updated (UTC)
ironclaw ✅ Success (View Logs) Web Jun 26, 2026 at 4:59 pm

serrrfirat added a commit that referenced this pull request Jun 25, 2026
Fold in PR review (henrypark133) + a durable-write hot-path audit. Adds §12
framing the deployed lease-expiry cascade as three independent layers:
- Layer A: in-process lock convoy → #5234 (open); write-behind composes on the
  post-#5234 CAS path.
- Layer B: pool starvation — DEFAULT_POSTGRES_POOL_MAX_SIZE=2 shared across all
  Postgres FS I/O. Notes the existing 30s checkout guard (closes the infinite
  hang) but flags pool-too-small + checkout(30s)>apply(15s); cheap mitigations
  (raise pool, reserve a critical connection, align checkout<apply) prior to and
  complementary with write-behind.
- Layer C: the synchronous hot-path write map (events / governor / thread-append
  / lease / memory) with the write-behind-vs-batch-coalesce distinction. Events
  are the top batch-coalesce target (highest churn, O(1) INSERT no CAS); must
  stay DURABLE (source of truth), per-step flush to preserve live SSE. Memory
  stays synchronous (FTS-only, no embedding write; agent-initiated, low churn).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolve PR #5234 review comments across the shared cas_update helper
and its consumers:

- cas: replace SystemTime-based jitter (collapses to 0 on coarse clocks)
  with a RandomState-seeded value; drop the now-unnecessary
  too_many_arguments allow (helper has 6 args, below clippy threshold).
- cas: add explicit CasApply::no_op so absent-record no-ops are signalled
  directly; make the contract doc true. Deletes the BridgeError::NoOp
  "success-through-an-error-channel" hack in ironclaw_turns (E is now
  TurnError directly) and simplifies map_cas_error.
- run_state: fix a TOCTOU race in FilesystemApprovalRequestStore::
  discard_pending (Codex P1). The non-atomic get->check->delete could
  remove a record an approve()/deny() CAS had just transitioned. Route
  discard through cas_update as a version-checked tombstone transition
  (new ApprovalStatus::Discarded); get/records_for_scope filter it so it
  still reads as gone. Adds a no-clobber regression test.
- turns/secrets: correct inaccurate CAS no-op comment; replace an
  unreachable!() panic in map_cas_error with a typed Unavailable error.
- add a deterministic CasUpdateError::Timeout regression test (paused
  Tokio time + hanging backend).
- docs: restore the feature-isolation cargo check matrix and full clippy
  gate in the migration plan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/ironclaw_product_workflow/src/approval_interaction/service.rs (1)

158-164: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Classify discarded approvals as stale before the persistent branch.

For persistent == true, ApprovalStatus::Discarded hits Line 163 and returns AlwaysAllowUnsupported, so the new stale arm at Line 199 is unreachable for that path.

Proposed fix
-        if matches!(status, ApprovalStatus::Denied | ApprovalStatus::Expired) {
+        if matches!(
+            status,
+            ApprovalStatus::Denied | ApprovalStatus::Expired | ApprovalStatus::Discarded
+        ) {

As per coding guidelines, WebUI-facing facade errors must expose the stable sanitized taxonomy for blocked approval/auth/resource and stale/conflict states.

Also applies to: 199-199

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs` around
lines 158 - 164, The approval gating in
approval_interaction::service::ApprovalInteractionService should classify
ApprovalStatus::Discarded as stale before the persistent-only unsupported
branch. Update the early status checks around the approval_rejected flow so the
stale rejection path is taken for Discarded regardless of persistent, and ensure
the later stale arm remains reachable with the sanitized StaleGate taxonomy
instead of falling through to AlwaysAllowUnsupported.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_filesystem/src/cas.rs`:
- Around line 102-114: The no-op early returns in cas_update can return a
snapshot that is no longer current, which then gets cached by
filesystem_store::apply_from_snapshot and similar callers. Update cas_update so
the equality and explicit no_op branches either revalidate the record/version
before returning or avoid exposing a cacheable “final” snapshot from those
paths; use the cas_update, CasApply::new, and CasApply::no_op branches as the
main touchpoints. Also adjust the surrounding contract/comments so they only
promise what is actually enforced after concurrent writes.

In `@crates/ironclaw_run_state/tests/run_state_contract.rs`:
- Around line 491-510: The regression test in
filesystem_discard_does_not_clobber_resolved_approval is still sequential and
does not force the TOCTOU interleaving it describes. Update this test to use a
store/filesystem harness that can pause discard_pending after it reads a Pending
record, then let approve run on the same request before discard_pending resumes
its write/CAS path. Use FilesystemApprovalRequestStore and
discard_pending/approve to verify the resolved approval is preserved and the
terminal record is not clobbered.

In `@docs/plans/2026-06-25-cas-migration.md`:
- Around line 141-145: The quality gate currently uses cargo fmt --all, which
reformats files instead of validating CI-style formatting; update the checklist
entry in the plan to use the repo’s check-mode formatting command instead. Make
this change in the validation list alongside the other cargo check/clippy steps
so the migration plan matches the “Before You Open a PR” requirement and remains
authoritative.

---

Outside diff comments:
In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs`:
- Around line 158-164: The approval gating in
approval_interaction::service::ApprovalInteractionService should classify
ApprovalStatus::Discarded as stale before the persistent-only unsupported
branch. Update the early status checks around the approval_rejected flow so the
stale rejection path is taken for Discarded regardless of persistent, and ensure
the later stale arm remains reachable with the sanitized StaleGate taxonomy
instead of falling through to AlwaysAllowUnsupported.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: e8a5fca8-3068-48d7-92df-53d101912fa5

📥 Commits

Reviewing files that changed from the base of the PR and between 3223540 and 0eaf69d.

📒 Files selected for processing (10)
  • crates/ironclaw_capabilities/src/helpers.rs
  • crates/ironclaw_filesystem/Cargo.toml
  • crates/ironclaw_filesystem/src/cas.rs
  • crates/ironclaw_filesystem/src/cas/tests.rs
  • crates/ironclaw_product_workflow/src/approval_interaction/service.rs
  • crates/ironclaw_run_state/src/lib.rs
  • crates/ironclaw_run_state/tests/run_state_contract.rs
  • crates/ironclaw_secrets/src/filesystem_store.rs
  • crates/ironclaw_turns/src/filesystem_store.rs
  • docs/plans/2026-06-25-cas-migration.md

Comment thread crates/ironclaw_filesystem/src/cas.rs
Comment thread crates/ironclaw_run_state/tests/run_state_contract.rs
Comment thread docs/plans/2026-06-25-cas-migration.md Outdated
Reconcile the CAS-migration work with main, notably PR #5232 (durable
runner-lease sidecar) which independently rewrote
ironclaw_turns/filesystem_store.rs — splitting it into io/profile_resolver/
projection/runner_lease submodules and adding a per-run lease sidecar.

Resolution of crates/ironclaw_turns/src/filesystem_store.rs:
- Keep main's module split + runner-lease sidecar feature.
- Keep this PR's routing of the turn-state snapshot RMW through the shared
  ironclaw_filesystem::cas_update helper (no inline CAS loop, no BridgeError;
  E is TurnError directly; CasApply::no_op for the inert-snapshot no-op).
- The cas_update apply callback applies the runner-lease overlay per retry and
  uses the OVERLAID snapshot as the no-op baseline, matching main's pre-merge
  `new_snapshot == old_snapshot` semantics so an active lease overlay does not
  force a spurious write each apply. Covered by a new regression test
  (filesystem_turn_state_store_no_op_under_active_lease_overlay_does_not_rewrite_snapshot).
- service.rs: add ApprovalStatus::Discarded to the early stale-gate guard for
  consistency with the exhaustive match arms.

Known scoped exception, tracked in #5274: the runner-lease sidecar still uses a
local put_with_cas + cas_retry_backoff loop rather than cas_update. Documented
in crates/ironclaw_filesystem/CLAUDE.md; the main snapshot RMW does go through
cas_update.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 25, 2026 23:30 Destroyed
@henrypark133

Copy link
Copy Markdown
Collaborator Author

Addressed the Codex P1 (and CodeRabbit's matching outside-diff note) on discard_pending racing approve()/deny().

Fixed in 0eaf69d: the non-atomic get → check Pending → delete is replaced with a version-checked CAS transition through cas_update to a new ApprovalStatus::Discarded tombstone (mirrors update_status). If an approve()/deny() wins the CAS first, the discard's put hits VersionMismatch, re-reads, sees a non-Pending status, and returns ApprovalNotPending — so a resolved approval is never clobbered. get and records_for_scope filter Discarded, so a discarded request still reads as gone (contract unchanged). The in-memory store was left as-is — its get→check→remove is already atomic under one mutex.

Regression test added: filesystem_discard_does_not_clobber_resolved_approval (approve, then assert discard_pending returns ApprovalNotPending and the record stays Approved).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/ironclaw_product_workflow/src/approval_interaction/service.rs (2)

312-332: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Use debug! for these fallback diagnostics.

These new warn! calls are internal diagnostics on a best-effort path. In this repo, warn!/info! from runtime flows are reserved out because they corrupt the interactive REPL/TUI display. As per path instructions, REPL/TUI logging: info!/warn! corrupt the terminal UI — internal diagnostics use debug!.

Suggested diff
-            tracing::warn!(
+            tracing::debug!(
                 error = %error,
                 capability_id = %policy.key.capability_id,
                 action = ?policy.key.action,
                 "persistent approval policy write failed after approval resolution"
             );
@@
-            tracing::warn!(
+            tracing::debug!(
                 error = %error,
                 capability_id = %override_key.capability_id,
                 "tool permission override clear failed after persistent approval"
             );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs` around
lines 312 - 332, These fallback diagnostics in approval_interaction::service,
specifically the tracing calls inside the persistent_policies.allow and
tool_permission_overrides.clear error paths, should use debug! instead of warn!
because runtime warn/info logging can disrupt the REPL/TUI. Update both logging
sites to debug! while keeping the same error and capability context so the
internal best-effort failure reporting remains available without corrupting
terminal UI.

Source: Path instructions


306-333: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Route persistent-approval side effects through the approval layer.

This service now calls PersistentApprovalPolicyStore::allow() and ToolPermissionOverrideStore::clear() directly from ironclaw_product_workflow. That bypasses the approval boundary and leaves the durable allow + override-clear sequence owned by product code instead of the approval layer that already owns resolution semantics. Move this behind ApprovalResolutionPort (or a dedicated approval-layer port) and invoke that here. As per path instructions, Approve/deny decisions must go through ApprovalResolutionPort and TurnCoordinator; product/WebUI code must not directly execute tools or mutate approval stores ad hoc.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs` around
lines 306 - 333, Route the durable approval side effects out of
approval_interaction::service::persist_allow_policy and behind the approval
boundary instead of calling PersistentApprovalPolicyStore::allow() and
ToolPermissionOverrideStore::clear() directly. Introduce or use an
ApprovalResolutionPort-owned method that performs the allow-then-clear sequence,
and have persist_allow_policy delegate to that port so the approval layer
remains the single owner of resolution semantics and store mutation.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/ironclaw_product_workflow/src/approval_interaction/service.rs`:
- Around line 312-332: These fallback diagnostics in
approval_interaction::service, specifically the tracing calls inside the
persistent_policies.allow and tool_permission_overrides.clear error paths,
should use debug! instead of warn! because runtime warn/info logging can disrupt
the REPL/TUI. Update both logging sites to debug! while keeping the same error
and capability context so the internal best-effort failure reporting remains
available without corrupting terminal UI.
- Around line 306-333: Route the durable approval side effects out of
approval_interaction::service::persist_allow_policy and behind the approval
boundary instead of calling PersistentApprovalPolicyStore::allow() and
ToolPermissionOverrideStore::clear() directly. Introduce or use an
ApprovalResolutionPort-owned method that performs the allow-then-clear sequence,
and have persist_allow_policy delegate to that port so the approval layer
remains the single owner of resolution semantics and store mutation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: a098d8a1-20cb-419f-920f-fe4f83865273

📥 Commits

Reviewing files that changed from the base of the PR and between 0eaf69d and 2097c6d.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock, !**/Cargo.lock
📒 Files selected for processing (5)
  • crates/ironclaw_filesystem/CLAUDE.md
  • crates/ironclaw_product_workflow/src/approval_interaction/service.rs
  • crates/ironclaw_turns/src/filesystem_store.rs
  • crates/ironclaw_turns/src/filesystem_store/tests.rs
  • crates/ironclaw_turns/tests/filesystem_turn_state_contract.rs
💤 Files with no reviewable changes (3)
  • crates/ironclaw_turns/src/filesystem_store/tests.rs
  • crates/ironclaw_turns/tests/filesystem_turn_state_contract.rs
  • crates/ironclaw_turns/src/filesystem_store.rs

@henrypark133 henrypark133 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review (multi-agent)

Intent: Replace per-record filesystem mutexes with a shared CAS update helper across Reborn stores to eliminate lock convoys and related leaks.

Stats: 6 findings kept (from 8 raw, 6 after filtering/dedup) across 3 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 1. Filtered: 2 low-signal local-pattern doc notes.

All surviving findings are Medium severity, so I am leaving this as a comment review rather than requesting changes.

Security

  1. Medium Discarded approvals now keep the full request payload on disk (crates/ironclaw_run_state/src/lib.rs:947-956, confidence 88) — anchor: crates/ironclaw_run_state/src/lib.rs:947
    discard_pending now rewrites the filesystem record as Discarded instead of deleting it, but it keeps the full original ApprovalRequest body in the tombstone. That means a cancelled approval's action, reason, requester, and fingerprint remain recoverable from disk/backups even though public APIs hide discarded records. This is a data-minimization regression on an approval control-plane record.
    Fix: Store only the minimal tombstone data needed to reserve the request ID, or keep the discarded-ID reservation in a compact side index instead of retaining the full approval payload.

Tests

  1. Medium Missing test for op-time unsupported-write fail-closed path (crates/ironclaw_filesystem/src/cas.rs:337-340, confidence 95) — anchor: crates/ironclaw_filesystem/src/cas.rs:337
    The helper covers the pre-flight non-CAS rejection, but the second fail-closed path is untested: an unknown/default-capability backend can read successfully and then return FilesystemError::Unsupported { operation: WriteFile } from put. That branch maps to CasUpdateError::CasUnsupported and should be protected because it is the composite-router fallback path.
    Fix: Add tests::cas::unsupported_write_file_maps_to_cas_unsupported using a default-capabilities backend whose read succeeds and whose CAS write returns Unsupported(WriteFile).

Performance / Concurrency

  1. Medium CAS helper clones the full snapshot on every retry (crates/ironclaw_filesystem/src/cas.rs:299-307, confidence 86) — anchor: crates/ironclaw_filesystem/src/cas.rs:306
    cas_update clones the decoded snapshot before every apply, and several migrated call sites clone again when building CasApply::new(...). For large snapshots such as turn state, each write now copies the whole record at least twice, and contention multiplies that cost across the retry loop. That is avoidable churn on the hot persistence path this PR is trying to improve.
    Fix: Pass apply a borrowed snapshot or split the helper API so the helper retains one owned copy for equality/write checks while call sites clone only when they truly need ownership.

Conventions

  1. Medium Filesystem discard now blocks ID reuse, but the in-memory store does not (crates/ironclaw_run_state/src/lib.rs:947-956, confidence 82) — anchor: crates/ironclaw_run_state/tests/run_state_contract.rs:459 (no diff position — body only)
    The filesystem store now writes a Discarded tombstone, so a later save_pending with the same request ID is rejected even though get and records_for_scope hide the record. The in-memory store still removes the record outright, so tests and early wiring can accept a reused approval ID that production rejects. That diverges the store contract introduced by this PR.
    Fix: Make InMemoryApprovalRequestStore::discard_pending preserve a discarded tombstone, or separately track discarded IDs so save_pending rejects reused IDs in both backends.

Maintainability

  1. Medium Turn-state apply has become a 1k-line orchestration knot (crates/ironclaw_turns/src/filesystem_store.rs:381-533, confidence 95) — anchor: crates/ironclaw_turns/src/filesystem_store.rs:381
    This refactor pushes FilesystemTurnStateStore from 961 to 1051 lines, and the new apply flow now mixes cache invalidation, runner-lease overlay, transient store construction, FnMut adaptation, CAS no-op detection, timeout policy, and cache update/rollback in one method. That is a lot of state to hold in one control flow, especially in the turn-state persistence path.
    Fix: Split the CAS-specific orchestration into smaller helpers, or move runner-lease overlay/cache bookkeeping behind a focused helper so apply coordinates the transition at a higher level.
  2. Medium The apply timeout is now defined in two places (crates/ironclaw_turns/src/filesystem_store.rs:77-77, confidence 86) — anchor: crates/ironclaw_turns/src/filesystem_store.rs:77
    The turn store keeps its own 15-second FILESYSTEM_APPLY_TIMEOUT, while the shared CAS helper now owns the same 15-second loop timeout. The outer timeout comment says this may be shorter for tests, but the default value is still duplicated, so a future timeout-policy change can drift the turn store away from the helper contract.
    Fix: Use the shared ironclaw_filesystem::FILESYSTEM_APPLY_TIMEOUT as the turn-store default, or remove the outer default when the helper deadline is the only intended policy.

status: record.status,
});
}
// Write a Discarded tombstone so the file still exists (preventing

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Discarded approvals now keep the full request payload on disk.

discard_pending now rewrites the filesystem record as Discarded instead of deleting it, but it keeps the full original ApprovalRequest body in the tombstone. That means a cancelled approval's action, reason, requester, and fingerprint remain recoverable from disk/backups even though public APIs hide discarded records. This is a data-minimization regression on an approval control-plane record.

Fix: Store only the minimal tombstone data needed to reserve the request ID, or keep the discarded-ID reservation in a compact side index instead of retaining the full approval payload.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred (consistency, not a discard-specific bug). ApprovalRecord.request is non-optional and Approved/Denied records already retain the identical full payload on disk — the Discarded tombstone is consistent with every other terminal state. Singling out discard for minimization would be the inconsistency. If approval-payload retention matters it's a cross-cutting policy (pruning/redaction of all terminal approval records) plus a schema change (optional/redacted request), which is out of scope for this PR. Tracking as a retention follow-up.

cas_retry_backoff(attempt).await;
}
// 5b. Backend cannot CAS-write — fail closed (no blind overwrite).
Err(FilesystemError::Unsupported {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Missing test for op-time unsupported-write fail-closed path.

The helper covers the pre-flight non-CAS rejection, but the second fail-closed path is untested: an unknown/default-capability backend can read successfully and then return FilesystemError::Unsupported { operation: WriteFile } from put. That branch maps to CasUpdateError::CasUnsupported and should be protected because it is the composite-router fallback path.

Fix: Add tests::cas::unsupported_write_file_maps_to_cas_unsupported using a default-capabilities backend whose read succeeds and whose CAS write returns Unsupported(WriteFile).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 578e971 — added unsupported_write_file_maps_to_cas_unsupported: a default/unknown-capability backend whose get succeeds and whose put returns Unsupported{WriteFile}, asserting cas_update maps it to CasUpdateError::CasUnsupported. This exercises the op-time fallback path, distinct from the already-tested pre-flight rejection.

snapshot,
outcome,
write,
} = apply(current.clone())

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — CAS helper clones the full snapshot on every retry.

cas_update clones the decoded snapshot before every apply, and several migrated call sites clone again when building CasApply::new(...). For large snapshots such as turn state, each write now copies the whole record at least twice, and contention multiplies that cost across the retry loop. That is avoidable churn on the hot persistence path this PR is trying to improve.

Fix: Pass apply a borrowed snapshot or split the helper API so the helper retains one owned copy for equality/write checks while call sites clone only when they truly need ownership.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred (low-value perf). Confirmed two in-memory clones per attempt: cas_update clones current (needed — retained for the equality check after apply consumes its copy) and the turns call site clones new_snapshot (threaded into outcome to populate the read cache). Both are struct copies, not IO, and the loop only re-clones on VersionMismatch (rare). Eliminating them requires an API change (return the final snapshot from cas_update, or hash-based equality) for marginal gain — premature without a profile showing turn-state writes are hot. Will revisit with a benchmark.

// there is no lock contention in practice.
let apply = Arc::new(Mutex::new(apply));

let cas_future = cas_update(

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Turn-state apply has become a 1k-line orchestration knot.

This refactor pushes FilesystemTurnStateStore from 961 to 1051 lines, and the new apply flow now mixes cache invalidation, runner-lease overlay, transient store construction, FnMut adaptation, CAS no-op detection, timeout policy, and cache update/rollback in one method. That is a lot of state to hold in one control flow, especially in the turn-state persistence path.

Fix: Split the CAS-specific orchestration into smaller helpers, or move runner-lease overlay/cache bookkeeping behind a focused helper so apply coordinates the transition at a higher level.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it's doing a lot, deferred to #5274. The clean move is extracting the per-retry transform (overlay + store build + no-op detection) into a named helper so apply reads as orchestration. Follow-up #5274 (migrate the runner-lease sidecar onto cas_update) already rewrites this exact apply/CAS path, so the decomposition lands there — once — rather than re-churning a just-merged file twice. Added to #5274's scope. (Note: most of the 961→1051 growth is main's own module split landing via the merge; net the file is smaller than this branch's pre-merge 1455.)

/// unrelated callers behind one wedged operation.
pub const FILESYSTEM_CAS_RETRIES: usize = 32;
/// Deadline for the entire read-modify-write loop (including all retries).
pub const FILESYSTEM_APPLY_TIMEOUT: Duration = Duration::from_secs(15);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — The apply timeout is now defined in two places.

The turn store keeps its own 15-second FILESYSTEM_APPLY_TIMEOUT, while the shared CAS helper now owns the same 15-second loop timeout. The outer timeout comment says this may be shorter for tests, but the default value is still duplicated, so a future timeout-policy change can drift the turn store away from the helper contract.

Fix: Use the shared ironclaw_filesystem::FILESYSTEM_APPLY_TIMEOUT as the turn-store default, or remove the outer default when the helper deadline is the only intended policy.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 578e971 — the turn store now uses the shared ironclaw_filesystem::FILESYSTEM_APPLY_TIMEOUT instead of a duplicated local constant; the per-store override via with_apply_timeout (used in tests) is unchanged.

- cas/tests.rs: add `unsupported_write_file_maps_to_cas_unsupported` covering
  the op-time fail-closed path (default-capability backend reads OK, then `put`
  returns Unsupported{WriteFile} → CasUpdateError::CasUnsupported) — previously
  only the pre-flight non-CAS rejection was tested.
- run_state tests: replace the sequential discard regression with a
  deterministic TOCTOU harness (RootFilesystem decorator injects a racing
  approve() during discard's read→write window) so the CAS-retry-after-version-
  bump path is actually exercised; discard loses the race and returns
  ApprovalNotPending without clobbering the resolved record.
- turns/filesystem_store.rs: use the shared ironclaw_filesystem::
  FILESYSTEM_APPLY_TIMEOUT instead of a duplicated local 15s constant.
- docs/plans: quality gate uses check-mode `cargo fmt --all -- --check`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 26, 2026 03:42 Destroyed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/plans/2026-06-25-cas-migration.md (1)

141-146: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add cargo build to the quality gate.

Line 141 updates the checklist, but the repo’s required pre-PR validation still includes cargo build. Omitting it leaves this migration plan weaker than the documented submission gate.

As per coding guidelines, "Before You Open a PR" requires cargo build.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/2026-06-25-cas-migration.md` around lines 141 - 146, The quality
gate checklist is missing the required build validation, so update the plan’s
validation list to include cargo build alongside the existing cargo
fmt/check/clippy/test commands. Make the change in the checklist section that
enumerates the verification steps so the migration plan matches the repo’s
pre-PR submission gate.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_filesystem/src/cas/tests.rs`:
- Around line 511-582: The test only checks the საბოლოary
CasUpdateError::CasUnsupported, so it can still pass even if cas_update fails
earlier in the pre-flight capability check and never reaches
UnsupportedWriteBackend::put. Update the UnsupportedWriteBackend test helper to
record when put is invoked, and in
unsupported_write_file_maps_to_cas_unsupported assert both that cas_update
returns CasUpdateError::CasUnsupported and that put was actually called, so the
op-time fallback path is proven.

---

Outside diff comments:
In `@docs/plans/2026-06-25-cas-migration.md`:
- Around line 141-146: The quality gate checklist is missing the required build
validation, so update the plan’s validation list to include cargo build
alongside the existing cargo fmt/check/clippy/test commands. Make the change in
the checklist section that enumerates the verification steps so the migration
plan matches the repo’s pre-PR submission gate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 947d1e05-ffb2-4cfc-ac10-331d3eba110a

📥 Commits

Reviewing files that changed from the base of the PR and between 2097c6d and 578e971.

📒 Files selected for processing (5)
  • crates/ironclaw_filesystem/src/cas/tests.rs
  • crates/ironclaw_product_workflow/src/approval_interaction/service.rs
  • crates/ironclaw_run_state/tests/run_state_contract.rs
  • crates/ironclaw_turns/src/filesystem_store.rs
  • docs/plans/2026-06-25-cas-migration.md

Comment thread crates/ironclaw_filesystem/src/cas/tests.rs Outdated

@henrypark133 henrypark133 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review (multi-agent)

Intent: Replace per-record filesystem mutex convoys with a shared CAS update helper and migrate all Reborn stores to it.

Stats: 4 findings kept (from 5 raw, 4 after filtering/dedup) across 2 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0.

All surviving findings are Medium/Low, so this is a comment review rather than a request for changes. I did not repost prior threads that already have fixed replies or sound deferrals; the turn-state apply-structure concern is covered by the existing #5274 deferral.

Tests

  1. Medium cas_update lacks a generic backend-error regression test (crates/ironclaw_filesystem/src/cas.rs:330-341, confidence 78) — anchor: crates/ironclaw_filesystem/src/cas.rs:341
    The helper has coverage for first write, no-op, contention, timeout, and unsupported-write, but not the fallback Err(error) => CasUpdateError::Backend(error) branch on a normal write failure. A backend returning a non-Unsupported FilesystemError from put would still be unpinned here.
    Fix: Add tests::cas::backend_put_error_maps_to_backend covering a successful read followed by a generic backend error from put.
  2. Low Discarded approval status is untested (crates/ironclaw_capabilities/src/helpers.rs:273-279, confidence 84) — anchor: crates/ironclaw_capabilities/src/helpers.rs:273
    The new ApprovalStatus::Discarded arm has no direct regression test. Nothing asserts that discarded approvals map to ApprovalDiscarded, so this one-line branch could drift silently.
    Fix: Add tests::helpers::approval_not_approved_error_kind_maps_discarded_status covering ApprovalStatus::Discarded -> "ApprovalDiscarded".

Performance

  1. Low Last retry still pays a backoff sleep (crates/ironclaw_filesystem/src/cas.rs:333-334, confidence 89) — anchor: crates/ironclaw_filesystem/src/cas.rs:333
    The loop always awaits cas_retry_backoff(attempt) on VersionMismatch, including the final permitted attempt. That adds an avoidable 2-50ms of tail latency to every exhausted CAS call on the shared hot path, with no chance of success because no retry remains.
    Fix: Skip the sleep when attempt + 1 == FILESYSTEM_CAS_RETRIES and return RetriesExhausted immediately.

Local Patterns

  1. Low Fix the equality-path example in the public API docs (crates/ironclaw_filesystem/src/cas.rs:105-108, confidence 91) — anchor: crates/ironclaw_filesystem/src/cas.rs:105
    The public CasApply docs describe the unchanged-snapshot fast path as Some(existing) == snapshot, but snapshot is an S while current is the Option<S>. As written, the example is type-mismatched and makes the two no-op modes harder to distinguish for store authors reading the API surface.
    Fix: Rewrite the sentence in terms of the actual comparison, for example current is Some(existing) and the returned snapshot equals existing.

operation: FilesystemOperation::WriteFile,
..
}) => return Err(CasUpdateError::CasUnsupported),
Err(error) => return Err(CasUpdateError::Backend(error)),

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — cas_update lacks a generic backend-error regression test.

The helper has coverage for first write, no-op, contention, timeout, and unsupported-write, but not the fallback Err(error) => CasUpdateError::Backend(error) branch on a normal write failure. A backend returning a non-Unsupported FilesystemError from put would still be unpinned here.

Fix: Add tests::cas::backend_put_error_maps_to_backend covering a successful read followed by a generic backend error from put.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4b061dd — added backend_put_error_maps_to_backend: a CAS-capable backend whose get succeeds and whose put returns a generic (non-VersionMismatch, non-Unsupported{WriteFile}) FilesystemError, asserting cas_update maps it to CasUpdateError::Backend(_). Covers the catch-all arm at cas.rs:341.

Comment thread crates/ironclaw_filesystem/src/cas.rs Outdated
Ok(_) => return Ok(outcome),
// 5a. Lost the CAS race — re-read and retry with backoff.
Err(FilesystemError::VersionMismatch { .. }) => {
cas_retry_backoff(attempt).await;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — Last retry still pays a backoff sleep.

The loop always awaits cas_retry_backoff(attempt) on VersionMismatch, including the final permitted attempt. That adds an avoidable 2-50ms of tail latency to every exhausted CAS call on the shared hot path, with no chance of success because no retry remains.

Fix: Skip the sleep when attempt + 1 == FILESYSTEM_CAS_RETRIES and return RetriesExhausted immediately.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4b061dd — the VersionMismatch arm now guards the sleep with if attempt + 1 < FILESYSTEM_CAS_RETRIES, so the final exhausted attempt returns RetriesExhausted immediately without the 2-50ms backoff. Retry count unchanged.

ApprovalStatus::Approved => "ApprovalApproved",
ApprovalStatus::Denied => "ApprovalDenied",
ApprovalStatus::Expired => "ApprovalExpired",
ApprovalStatus::Discarded => "ApprovalDiscarded",

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — Discarded approval status is untested.

The new ApprovalStatus::Discarded arm has no direct regression test. Nothing asserts that discarded approvals map to ApprovalDiscarded, so this one-line branch could drift silently.

Fix: Add tests::helpers::approval_not_approved_error_kind_maps_discarded_status covering ApprovalStatus::Discarded -> "ApprovalDiscarded".

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4b061dd — added approval_not_approved_error_kind_maps_discarded_status asserting ApprovalStatus::Discarded -> "ApprovalDiscarded".

/// Two no-op signals are available, covering the two cases where `apply` should
/// skip the write entirely:
///
/// 1. **Existing record, unchanged snapshot** — return `CasApply::new` with

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — Fix the equality-path example in the public API docs.

The public CasApply docs describe the unchanged-snapshot fast path as Some(existing) == snapshot, but snapshot is an S while current is the Option<S>. As written, the example is type-mismatched and makes the two no-op modes harder to distinguish for store authors reading the API surface.

Fix: Rewrite the sentence in terms of the actual comparison, for example current is Some(existing) and the returned snapshot equals existing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4b061dd — the CasApply doc example no longer mismatches types; it now describes the actual check (current is Some(existing) and the returned snapshot equals existing), matching matches!(&current, Some(e) if *e == snapshot) and distinguishing it from the explicit CasApply::no_op path.

- cas/tests.rs: assert `UnsupportedWriteBackend::put` was actually invoked in
  `unsupported_write_file_maps_to_cas_unsupported` (proves the op-time fallback
  path, not just the final enum which pre-flight also yields).
- cas/tests.rs: add `backend_put_error_maps_to_backend` covering the generic
  `put` failure → `CasUpdateError::Backend` catch-all arm.
- cas.rs: skip the jittered backoff sleep on the final CAS attempt
  (`attempt + 1 < FILESYSTEM_CAS_RETRIES`) — no retry remains, so it was pure
  tail latency on the shared hot path.
- cas.rs: fix the type-mismatched equality example in the `CasApply` docs
  (`current` is `Some(existing)` and the returned snapshot equals `existing`).
- capabilities/helpers.rs: regression test for
  `ApprovalStatus::Discarded -> "ApprovalDiscarded"`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 26, 2026 04:21 Destroyed

@henrypark133 henrypark133 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review (multi-agent)

Intent: Remove per-record filesystem mutex convoys by migrating Reborn filesystem-backed stores to a shared CAS update helper.

Stats: 4 findings (from 8 reviewer lanes; 1 High, 3 Medium) across 1 file. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 0.

Prior-thread context was considered. I did not repost the already-addressed items or the soundly deferred follow-ups around Send bound creep, BackendCapabilities::default() as an explicit-design cleanup, approval retention/tombstone policy, clone-cost benchmarking, or turn-state decomposition in #5274.

Bugs

  1. High Byte-only backends still slip through on first write (crates/ironclaw_filesystem/src/cas.rs:256-347, confidence 88) — anchor: crates/ironclaw_filesystem/src/cas.rs:260
    The fail-closed CAS gate can bypass default-capability composite roots and then successfully first-write JSON byte snapshots on local/byte-only mounts via CasExpectation::Absent.

Tests

  1. Medium Missing test for backend read failure propagation (crates/ironclaw_filesystem/src/cas.rs:292-298, confidence 93) — anchor: crates/ironclaw_filesystem/src/cas.rs:298
    The helper's get error branch should be pinned as CasUpdateError::Backend.
  2. Medium Missing test for malformed snapshot decode failure (crates/ironclaw_filesystem/src/cas.rs:293-295, confidence 98) — anchor: crates/ironclaw_filesystem/src/cas.rs:294
    Malformed persisted bytes should be pinned as CasUpdateError::Apply at the helper boundary.
  3. Medium Missing test for encode failure propagation (crates/ironclaw_filesystem/src/cas.rs:325-327, confidence 98) — anchor: crates/ironclaw_filesystem/src/cas.rs:327
    Encoder failures should be pinned as CasUpdateError::Apply without writing.

// *known* shape; the composite router advertises the empty default and
// defers to the op-time check below. Either way we never fall back to
// `CasExpectation::Any`.
let capabilities = filesystem.capabilities();

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — Byte-only backends still slip through on first write.

The fail-closed gate only rejects backends that advertise a non-default capability shape or that return Unsupported(WriteFile) from put. A composite filesystem reports BackendCapabilities::default() before path resolution, then routes the write to the mounted backend. LocalFilesystem accepts CasExpectation::Absent for plain byte entries, and CAS snapshots are JSON byte entries, so the first snapshot write on a local/byte-only mount can succeed instead of surfacing CasUnsupported. Later Version(...) writes fail, but the helper has already violated its no-CAS fail-closed contract for fresh records.

Fix: Do not infer CAS support from the root composite capabilities alone. Gate against path-resolved mount capabilities, or reject unknown/default roots before issuing any write so Absent writes cannot succeed on non-CAS mounts.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0d4bc1c — made all CAS-store encoders record-shaped (entry.kind = Some(RecordKind)), so LocalFilesystem's existing record-shaped rejection fires fail-closed uniformly on the Absent first-write too (turns was already protected; run_state/secrets/resources/threads were not). Self-enforcing at the data-model level, no new helper branches. Severity is really Medium — Absent is create-if-absent and all updates fail-closed, so there is no blind overwrite / lost update; only creation succeeded then updates failed loudly. New record_shaped_first_write_fails_closed_on_byte_only_backend + a store-level fail-closed test over a real byte-only LocalFilesystem mount; verified across the postgres/libsql/all-features matrix.

(Some(decoded), Some(versioned.version))
}
Ok(None) => (None, None),
Err(error) => return Err(CasUpdateError::Backend(error)),

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Missing test for backend read failure propagation.

The filesystem.get(...).await error arm now returns CasUpdateError::Backend, but the new CAS helper tests never inject a read failure. The suite covers contention, capability gating, timeout, and put failures, but not the branch that should surface a backend failure before a snapshot can be loaded.

Fix: Add tests::cas::get_error_is_returned_as_backend using a backend whose get returns a non-NotFound FilesystemError and assert cas_update returns CasUpdateError::Backend(_).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0d4bc1c — added backend_get_error_maps_to_backend pinning a get failure to CasUpdateError::Backend.

// committed version, never a torn one.
let (current, version) = match filesystem.get(scope, path).await {
Ok(Some(versioned)) => {
let decoded = decode(&versioned.entry.body).map_err(CasUpdateError::Apply)?;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Missing test for malformed snapshot decode failure.

The decode(&versioned.entry.body) failure path maps to CasUpdateError::Apply, but no test drives cas_update with invalid bytes in an existing record. That helper boundary should be pinned directly because malformed persisted snapshots are a distinct failure mode from apply-closure errors.

Fix: Add tests::cas::decode_error_is_returned_as_apply with an existing record containing malformed JSON and assert the decode error is returned through CasUpdateError::Apply.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0d4bc1c — added malformed_snapshot_decode_maps_to_apply pinning a decode failure to CasUpdateError::Apply.


// 4. Encode + CAS put. Absent → create-if-absent; existing → version
// precondition.
let entry = encode(&snapshot).map_err(CasUpdateError::Apply)?;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Missing test for encode failure propagation.

The encode(&snapshot) failure path also maps to CasUpdateError::Apply, but the current tests only use an encoder that succeeds. A serialization failure would therefore be unpinned at the shared helper boundary.

Fix: Add tests::cas::encode_error_is_returned_as_apply using an encoder that returns TestError after apply produces a changed snapshot, and assert CasUpdateError::Apply is returned without writing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0d4bc1c — added encode_failure_maps_to_apply_without_write pinning an encode failure to CasUpdateError::Apply and asserting no put was attempted.

… tests

Multi-agent review (High): cas_update's fail-closed guarantee had a gap on the
first (absent) write. A default/unknown-capability backend skips the pre-flight
gate, and LocalFilesystem.put rejects only record-shaped entries
(entry.kind.is_some()) — so a BYTE-ONLY entry written with CasExpectation::Absent
was accepted, slipping past the op-time fail-closed check. ironclaw_turns already
encoded record-shaped entries (kind set) and was protected; run_state, secrets,
resources, and threads encoded byte-only entries and were not.

Fix (make the contract self-enforcing at the data-model level, not mount-config
dependent): set entry.kind on every CAS-store encoder so they are record-shaped
like turns. LocalFilesystem's existing record-shaped rejection then makes the
op-time fail-closed fire uniformly on the Absent path too — no new branches in
the shared helper. Reads are unaffected (body decode ignores kind); additive vs
existing kind=None rows.

- run_state: run-record + approval-record encoders set kind; store-level
  fail-closed regression test over a byte-only LocalFilesystem mount.
- secrets: secret / lease / account / session encoders set kind.
- resources: Snapshot trait gains RECORD_KIND; cas_snapshot encoder stamps it;
  ResourceGovernor + BudgetGate snapshots declare their kind.
- threads: session / message / summary / idempotency encoders set kind.
- cas/tests.rs: `record_shaped_first_write_fails_closed_on_byte_only_backend`
  proves the mechanism end-to-end (kind-gated put → CasUnsupported on first
  write), plus error-propagation tests: get-failure → Backend, decode-failure →
  Apply, encode-failure → Apply (no write).

Note (kind identifiers): RecordKind validates as `[A-Za-z_][A-Za-z0-9_]*`, so
snake_case kinds (matching turns' `turn_state_snapshot`), not kebab-case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 26, 2026 05:05 Destroyed
@henrypark133

Copy link
Copy Markdown
Collaborator Author

Addressed the latest multi-agent review (1 High + 3 Medium tests) in 0d4bc1c.

High — byte-only backends slip through on first write. Valid, and a touch broader than flagged: LocalFilesystem.put rejects only record-shaped entries (entry.kind.is_some()) before the CAS check, so a byte-only entry written with CasExpectation::Absent was accepted — slipping past op-time fail-closed on the first write. ironclaw_turns already encoded record-shaped (kind set) and was protected; run_state/secrets/resources/threads encoded byte-only entries and were not.
Severity is really Medium, not data-integrity: Absent is create-if-absent (concurrent creates → one winner + VersionMismatch for the rest → retry hits the Version path → fail-closed), and all updates fail-closed, so there is no blind overwrite / lost update — the gap was only that creation succeeded, then every update failed loudly.
Fix (self-enforcing at the data-model level rather than mount-config dependent): set entry.kind on every CAS-store encoder so they are record-shaped like turns. LocalFilesystem’s existing record-shaped rejection then fires uniformly on the Absent path too — no new branches in the shared helper, and it removes the turns-vs-rest inconsistency. Additive/back-compatible (reads ignore kind; old kind=None rows still decode). New record_shaped_first_write_fails_closed_on_byte_only_backend proves the mechanism end-to-end; a store-level fail-closed test over a real byte-only LocalFilesystem mount was added in run_state. Verified across the postgres/libsql/all-features matrix.

Medium ×3 — missing error-path tests. Added: read failure → CasUpdateError::Backend (backend_get_error_maps_to_backend); malformed decode → Apply (malformed_snapshot_decode_maps_to_apply); encode failure → Apply with no write attempted (encode_failure_maps_to_apply_without_write).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/ironclaw_threads/src/filesystem_service.rs (1)

1118-1125: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Validate the stored thread ID before returning an existing record.

Line 1120 only scope-checks existing; if the path contains a same-scope record for a different thread_id, ensure_thread returns the wrong thread as successfully ensured. Match read_thread_versioned’s identity check and reject mismatched IDs.

Proposed fix
-                        if existing.record.scope != scope {
+                        if existing.record.scope != scope || existing.record.thread_id != thread_id {
                             Err(SessionThreadError::ThreadScopeMismatch { thread_id })
                         } else {

As per coding guidelines, “Preserve tenant/user/agent/project/mission/thread scope on authority, state, memory, process, network, outbound, resource, and event records.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ironclaw_threads/src/filesystem_service.rs` around lines 1118 - 1125,
The existing-return path in ensure_thread only checks scope and can incorrectly
accept a record whose stored thread_id does not match the requested thread_id.
Update the Some(existing) branch in filesystem_service::ensure_thread to also
validate existing.record.thread_id against the requested thread_id, matching the
identity check used by read_thread_versioned, and return a thread-mismatch error
when the IDs differ before returning CasApply::new(existing.clone(),
existing.record).

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ironclaw_filesystem/src/cas/tests.rs`:
- Around line 975-995: The byte-only mock in KindGatedByteOnlyBackend::put only
verifies the entry shape and ignores the cas argument, so the test still passes
even if cas_update stops passing CasExpectation::Absent for first writes. Update
the mock to capture and assert the CAS precondition alongside path and entry,
and ensure the test exercises the absent-write case through cas_update so every
argument the production caller passes is validated.

---

Outside diff comments:
In `@crates/ironclaw_threads/src/filesystem_service.rs`:
- Around line 1118-1125: The existing-return path in ensure_thread only checks
scope and can incorrectly accept a record whose stored thread_id does not match
the requested thread_id. Update the Some(existing) branch in
filesystem_service::ensure_thread to also validate existing.record.thread_id
against the requested thread_id, matching the identity check used by
read_thread_versioned, and return a thread-mismatch error when the IDs differ
before returning CasApply::new(existing.clone(), existing.record).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 7c6c776b-3c08-4c47-8ace-e275948a1864

📥 Commits

Reviewing files that changed from the base of the PR and between 4b061dd and 0d4bc1c.

📒 Files selected for processing (8)
  • crates/ironclaw_filesystem/src/cas/tests.rs
  • crates/ironclaw_resources/src/cas_snapshot.rs
  • crates/ironclaw_resources/src/filesystem_store.rs
  • crates/ironclaw_resources/src/lib.rs
  • crates/ironclaw_run_state/src/lib.rs
  • crates/ironclaw_run_state/tests/run_state_contract.rs
  • crates/ironclaw_secrets/src/filesystem_store.rs
  • crates/ironclaw_threads/src/filesystem_service.rs

Comment thread crates/ironclaw_filesystem/src/cas/tests.rs
- cas/tests.rs: the byte-only mock's `put` now captures `cas` and asserts
  `CasExpectation::Absent`, so the fail-closed regression also pins that
  cas_update uses Absent for the first (absent) write — not just the
  record-shaped op-time rejection.
- threads/filesystem_service.rs: ensure_thread's existing-return branch now
  rejects a thread_id mismatch as well as a scope mismatch, matching
  read_thread_versioned (line 236) and the message-read guard. The thread
  path is thread_id-keyed so this is defense-in-depth parity rather than a
  reachable bug via the public API; bringing it in line with the canonical
  read path removes the inconsistency a reviewer flagged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 26, 2026 05:56 Destroyed
@henrypark133

Copy link
Copy Markdown
Collaborator Author

Re the outside-diff Major on ironclaw_threads/filesystem_service.rs ensure_thread (validate stored thread ID): addressed in 878a44c. The Some(existing) branch now rejects a thread_id mismatch as well as a scope mismatch, matching read_thread_versioned (line 236) and the message-read guard (line 407).
Worth noting it is defense-in-depth, not a reachable bug today: thread_record_path is {scope}/{thread_id}/thread.json — thread_id is a literal path segment, so a CAS read at that path can only return a record written there under the same thread_id. The change removes the inconsistency with the canonical read paths rather than fixing a live mis-identification, so no fabricated regression test was added (the mismatch is unreachable via the public API).

@henrypark133 henrypark133 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review (multi-agent)

Intent: Remove per-record filesystem lock convoys by migrating Reborn persistence stores to a shared CAS update helper.

Stats: 2 findings kept (from 6 raw final-head reviewer findings, 2 after live-thread/stale filtering) across 2 files. Reviewers run: security, bugs, performance, tests, conventions, local-patterns, maintainability, approach. Reviewers failed: none. Body-only: 1.

Prior-thread context was considered. I filtered out the repeated CAS helper findings for byte-only backends and get/decode/encode error-path tests because the live threads already mark them fixed on the current series and the final head contains the corresponding tests.

Tests

  1. Medium Cover consuming an already-expired lease (crates/ironclaw_secrets/src/filesystem_store.rs:498-504, confidence 86) — anchor: crates/ironclaw_secrets/src/filesystem_store.rs:498
    The new SecretLeaseStatus::Expired branch in FilesystemSecretStore::consume returns LeaseExpired directly when the lease record is already marked Expired, but there is no regression test for that already-expired path. A zero-TTL or pre-aged lease could start rewriting state or surfacing the wrong error without a failing test.

Local Patterns

  1. Low Drop the stale per-path lock description from the CAS snapshot docs (crates/ironclaw_resources/src/cas_snapshot.rs:8-10, confidence 95) — anchor: crates/ironclaw_resources/src/cas_snapshot.rs:8 (no diff position — body only)
    The module docs still say the shared snapshot stores use an in-process per-path async lock that serializes same-process writers, but this module now routes updates through CasSnapshotStore and the shared lock-free cas_update helper. That wording is now inaccurate and can mislead future maintainers about the actual concurrency model, especially when adding new stores.

SecretLeaseStatus::Revoked => {
Err(SecretStoreError::LeaseRevoked { lease_id })
}
SecretLeaseStatus::Expired => {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Cover consuming an already-expired lease.

The new SecretLeaseStatus::Expired branch in FilesystemSecretStore::consume returns LeaseExpired directly when the lease record is already marked Expired, but there is no regression test for that already-expired path. A zero-TTL or pre-aged lease could start rewriting state or surfacing the wrong error without a failing test.

Fix: Add a filesystem secret-store test that creates or marks an already-expired lease, calls consume, and asserts LeaseExpired is returned without issuing another write.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 805bd15 — added consume_already_expired_lease_returns_expired_without_rewrite. It forces a genuine stored Expired state (lease TTL = -1s → first consume promotes Active→Expired with one write → second consume hits the already_marked branch), asserts Err(SecretStoreError::LeaseExpired), and captures the InMemoryBackend record version before/after to prove the already-marked path issues NO second write (version unchanged).

… LocalFilesystem

CI regression from making the CAS-store encoders record-shaped (0d4bc1c):
LocalFilesystem.put rejects record-shaped entries (entry.kind.is_some()) with
Unsupported{WriteFile}, which cas_update maps to fail-closed CasUnsupported.
Several reborn test harnesses backed run-state / approval / session-thread
stores with a byte-only LocalFilesystem — a backend production never uses for
these stores (the filesystem guardrail states all production store mounts
resolve to CAS-capable db/in-memory backends; LocalFilesystem is structurally
unreachable). So the failures are the intended fail-closed contract correctly
rejecting a non-production backend, not a code regression.

Fix the harnesses, not the code: route the CAS stores through a shared
Arc<InMemoryBackend> (full BackendCapabilities, CAS-capable) which models the
production database-backed filesystem. The durable-restart test keeps its three
service-graph instances over one shared backend, preserving the
state-survives-restart semantics; LocalFilesystem still backs the byte-only
capability-lease / process services (which don't set RecordKind). No assertions
weakened, no tests ignored.

- crates/ironclaw_host_runtime/tests/reborn_durable_restart_integration.rs
- tests/support/reborn/session_thread.rs
- tests/support/reborn/harness.rs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 26, 2026 06:28 Destroyed
…stale CAS doc

- secrets: add `consume_already_expired_lease_returns_expired_without_rewrite`.
  The `SecretLeaseStatus::Expired` already-marked branch of `consume` returns
  LeaseExpired via the apply-error path with no write; the test forces a stored
  Expired state (TTL=-1 → first consume promotes to Expired → second consume
  hits the already-marked path) and asserts the backend record version does NOT
  bump (no second write).
- resources/cas_snapshot.rs: module doc no longer describes an in-process
  per-path async lock — it now routes through the lock-free shared cas_update
  helper (versioned CAS + retry on VersionMismatch + jittered backoff).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@railway-app railway-app Bot temporarily deployed to ironclaw-ci-preview / ironclaw-pr-5234 June 26, 2026 16:36 Destroyed
@henrypark133

Copy link
Copy Markdown
Collaborator Author

Re the Low (body-only) finding on ironclaw_resources/src/cas_snapshot.rs module docs (stale per-path lock description): fixed in 805bd15. The header no longer mentions an in-process per-path async lock serializing same-process writers — it now describes the actual model: lock-free optimistic concurrency via the shared cas_update helper (versioned compare-and-swap, retry on VersionMismatch with bounded jittered backoff). Verified no other stale lock/serializ mentions remain in the module docs (the remaining hits are accurate method/test notes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules scope: dependencies Dependency updates scope: docs Documentation size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant